Penalized maximum likelihood estimation methods, the baum welch algorithm and diagonal balancing of symmetric matrices for the training of acoustic models in speech recognition

ABSTRACT

A nonparametric family of density functions formed by histogram estimators for modeling acoustic vectors are used in automatic recognition of speech. A Gaussian kernel is set forth in the density estimator. When the densities are found for all the basic sounds in a training stage, an acoustic vector is assigned to a phoneme label corresponding to the highest likelihood for the basis of the decoding of acoustic vectors into text.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention generally relates to methods of speech recognitionand, more particularly, to nonparametric density estimation of highdimensional data for use in training models for speech recognition.

2. Background Description

In the present invention, we are concerned with nonparametric densityestimation of high dimensional data. The invention is driven by itspotential application to training speech data where traditionally onlyparametric methods have been used. Parametric models typically lead tolarge scale optimization problems associated with a desire to maximizethe likelihood of the data. In particular, mixture models of gaussiansare used for training acoustic vectors for speech recognition, and theparameters of the model are obtained by using K-means clustering and theEM algorithm, see F. Jelinek, Statistical Methods for SpeechRecognition, The MIT Press, Cambridge Mass., 1998. Here we consider thepossibility of maximizing the penalized likelihood of the data as ameans to identify nonparametric density estimators, see I. J. Good andR. A. Gaskin, “Nonparametric roughness penalties for probabilitydensities,” Biometrika 58, pp. 255-77, 1971. We develop variousmathematical properties of this point of view, propose severalalgorithms for the numerical solution of the optimization problems weencounter, and we report on some of our computational experience withthese methods. In this regard, we integrate within our framework atechnique that is central in many aspects of the statistical analysis ofacoustic data, namely the Baum Welch algorithm, which is especiallyimportant for the training of Hidden Markov Models, see again the bookby F. Jelinek, cited above.

Let us recall the mechanism in which density estimation of highdimensional data arises in speech recognition. In this context, aprincipal task is to convert acoustic waveforms into text. The firststep in the process is to isolate important features of the waveformover small time intervals (typically 25 mls). These features,represented by a vector xεR^(d) (where d usually is 39) are thenidentified with context dependent sounds, for example, phonemes such as“AA”, “AE”, “K”, “H”. Strings of such basic sounds are then convertedinto words using a dictionary of acoustic representations of words. Forexample, the phonetic spelling of the word “cat” is “K AE T”. In anideal situation the feature vectors generated by the speech waveformwould be converted into a string of phonemes “K . . . K AE . . . AE T .. . T” from which we can recognize the word “cat” (unfortunately, aphoneme string seldom matches the acoustic spelling exactly).

One of the important problems associated with this process is toidentify a phoneme label for an individual acoustic vector x. Trainingdata is provided for the purpose of classifying a given acoustic vector.A standard approach for classification in speech recognition is togenerate initial “prototypes” by K-means clustering and then refine themby using the EM algorithm based on mixture models of gaussian densities,cf F. Jelinek, cited above. Moreover, in the decoding stage of speechrecognition (formation of Hidden Markoff Models) the output probabilitydensity functions are most commonly assumed to be a mixture of gaussiandensity functions, cf. L. E. Baum and J. A. Eagon, “An inequality withapplications to statistical estimation for probabilistic functions ofMarkov processes and to a model of ecology,” Bull. Amer. Math. Soc. 73,pp. 360-63, 1967; L. A. Liporace, “Piecewise polynomial, positivedefinite and compactly supported radial functions of minimal degree,”IEEE Trans. on Information Theory 5, pp. 729-34, 1982; R. A. Gopinath,“Constrained maximum likelihood modeling with gaussian distributions,”Broadcast News Transcription and Understanding Workshop, 1998.

SUMMARY OF THE INVENTION

According to this invention, we adopt the commonly used approach toclassification and think of the acoustic vectors for a given sound as arandom variable whose density is estimated from the data. When thedensities are found for all the basic sounds (this is the trainingstage) an acoustic vector is assigned the phoneme label corresponding tothe highest scoring likelihood (probability). This information is thebasis of the decoding of acoustic vectors into text.

Since in speech recognition x is typically a high dimensional vector andeach basic sound has only several thousand data vectors to model it, thetraining data is relatively sparse. Recent work on the classification ofacoustic vectors, see S. Basu and C. A. Micchelli, “Maximum likelihoodestimation for acoustic vectors in speech recognition,” AdvancedBlack-Box Techniques For Nonlinear Modeling: Theory and Applications,Kluwer Publishers (1998), demonstrates that mixture models withnon-gaussian mixture components are useful for parametric densityestimation of speech data. We explore the use of nonparametrictechniques. Specifically, we use the penalized maximum likelihoodapproach introduced by Good and Gaskin, cited above. We combine thepenalized maximum likelihood approach with the use of the Baum Welchalgorithm, see L. E. Baum, T. Petrie, G. Soules and N. Weiss, “Amaximization technique occurring in the statistical analysis ofprobabilistic functions of Markov chains,” The Annals of MathematicalStatistics 41, No. 1, pp. 164-71, 1970; Baum and Eagon, cited above,often used in speech recognition for training Hidden Markoff Models(HMMs). (This algorithm is a special case of the celebrated EM algorithmas described, e.g., A. P. Dempster, N. M. Liard and D. B. Baum, “Maximumlikelihood from incomplete data via the EM algorithm,” Journal of RoyalStatistical Soc. 39(B), pp. 1-38, 1977.)

We begin by recalling that one of the most widely used nonparametricdensity estimators has the form $\begin{matrix}{{{f_{n}(x)} = {\frac{1}{{nh}^{d}}{\sum\limits_{ \in Z_{n}}{k\left( \frac{x - x^{i}}{h} \right)}}}},{x \in R^{d}}} & (1)\end{matrix}$

where Z_(n)={1, . . . , n}, k is some specified function, and{x_(i):iεZ_(n)} is a set of observations in R^(d) of some unknown randomvariable, cf. T. Cacoullos, “Estimates of a multivariate density,”Annals of the Institute of Statistical Mathematics 18, pp. 178-89, 1966;E. Parzen, “On the estimation of a probability density function and themode,” Annals of the Institute of Statistical Mathematics 33, pp.1065-76, 1962; M. Rosenblatt, “Remarks on some nonparametric estimatesof a density function,” Annals of Mathematical Statistics 27, pp.832-37, 1956. It is well known that this estimator converges almostsurely to the underlying probability density function (PDF) providedthat the kernel k is strictly positive on R^(d), ∫_(R) _(^(d)) k(x)dx=1,h→0, nh→∞, and n→∞. The problem of how best to choose n and h for afixed kernel k for the estimator (1) has been thoroughly discussed inthe literature, cf. L. Devroye and L. Györfi, Nonparametric DensityEstimation, The L₁ View, John Wiley & Sons, New York, 1985.

In this invention, we are led, by the notion of penalized maximumlikelihood estimation (PMLE), to density estimators of the form$\begin{matrix}{{{f(x)} = {\sum\limits_{ \in Z_{n}}{c_{i}{k\left( {x,x^{i}} \right)}}}},{x \in R^{d}}} & (2)\end{matrix}$

where k(x, y), x, yεR^(d) is the reproducing kernel in some Hilbertspace H, cf S. Saitoh, Theory of Reproducing Kernels and itsApplications, Pilman Research Notes in Mathematical Analysis, LongmanScientific and Technical, Essex, UK, 189,1988.

Among the methods we consider, the coefficients in this sum are chosento maximize the homogeneous polynomial $\begin{matrix}{{{\prod\quad ({Kc})}:={\prod\limits_{ \in Z_{n}}\quad \left( {\sum\limits_{j \in Z_{n}}{K_{ij}c_{j}}} \right)}},{c = \left( {c_{1},\ldots \quad,c_{n}} \right)^{T}},} & (3)\end{matrix}$

over the simplex

S ^(n) ={c:cεR ₊ ^(n) , e ^(T) c=1},  (4)

where e=(1, . . . ,1)^(T)εR^(n),

 R ₊ ^(n) ={c:c=(c ₁ , . . . ,c _(n))^(T) , c _(i)≧0, iεZ _(n)},  (5)

the positive orthant, K is the matrix

K={K _(ij)}_(i,jεZ) _(n) ={k(x ^(i) ,x ^(j))}_(i,jεZ) _(n)   (6)

and accomplish this numerically by the use of the Baum Welch algorithm,cf. L. E. Baum, T. Petrie, G. Soules and N. Weiss, cited above; L. E.Baum and J. A. Eagon, cited above. A polynomial in the factored form (3)appears in the method of deleted interpolation which occurs in languagemodeling, see L. R. Bahl, P. F. Brown, P. V. de Souza, R. L. Mercer, andD. Nahamoo, “A fast algorithm for deleted interpolation,” ProceedingsEurospeech 3, pp. 1209-12, 1991. In the context of geometric modelingthey have been called lineal polynomials, see A. S. Cavaretta and C. A.Micchelli, “Design of curves and surfaces by subdivision algorithms,” inMathematical Methods in Computer Aided Geometric Design, T. Lyche and L.Schumaker (eds.), Academic Press, Boston, 1989, 115-53, and C. A.Micchelli, Mathematical Aspects of Geometric Modeling, CBMSF-NSFRegional Conference Series in Applied Mathematics 65, SIAM Philadelphia,1995. A comparison of the Baum Welch algorithm to the degree raisingalgorithm, see C. A. Micchelli and A. Pinkus, “Some remarks onnonnegative polynomials on polyhedra” in Probability, Statistics andMathematics: Papers in Honor of Samuel Karlin, Eds. T. W. Anderson, K.B. Athreya and D. L. Iglehart, Academic Press, San Diego, pp. 163-86,1989, which can also be used to find the maximum of a homogeneouspolynomial over a simplex, will be made. We also elaborate upon theconnection of these ideas to the problem of the diagonal similarity of asymmetric nonsingular matrix with nonnegative elements to a doublystochastic matrix, see M. Marcus and M. Newman, “The permanent of asymmetric matrix,” Abstract 587-85, Amer. Math. Soc. Notices 8, 595; R.Sinkhorn, “A relationship between arbitrary positive matrices and doublystochastic matrices,” Ann. Math. Statist. 38, pp. 439-55, 1964. Thisproblem has attracted active interest in the literature, see M.Bacharach, “Biproportional Matrices and Input-Output Change,” Monograph16, Cambridge University press, 1970; L. M. Bergman, “Proof of theconvergence of Sheleikhovskii's method for a problem with transportationconstraints,” USSR Computational Mathematics and Mathematical Physics1(1), pp. 191-204, 1967; S. Brualdi, S. Parter and H. Schneider, “Thediagonal equivalence of a non-negative matrix to a stochastic matrix,”J. Math. Anal. and Appl. 16, pp. 31-50, 1966; J. Csima and B. N. Datta,“The DAD theorem for symmetric non-negative matrices,” Journal ofCombinatorial Theory 12(A), pp. 147-52, 1972; G. M. Engel and H.Schneider, “Algorithms for testing the diagonal similarity of matricesand related problems,” SIAM Journal of Algorithms in DiscreteMathematics 3(4), pp.429-38, 1982; T. E. S. Raghavan, “On pairs ofmultidimensional matrices,” Linear Algebra and Applications 62, pp.263-68, 1984; G. M. Engel and H. Schneider, “Matrices diagonally similarto a symmetric matrix,” Linear Algebra and Applications 29, pp. 131-38,1980; J. Franklin and J. Lorenz, “On the scaling of multidimensionalmatrices,” Linear Algebra and Applications 114/115, pp. 717-35, 1989; D.Hershkowitz, U. G. Rothblum and H. Schneider, “Classification ofnonnegative matrices using diagonal equivalence,” SIAM Journal on MatrixAnalysis and Applications 9(4), pp. 455-60, 1988; S. Karlin and L.Nirenberg, “On a theorem of P. Nowosad,” Mathematical Analysis andApplications 17, pp.61-67, 1967; A. W. Marshall and I. Olkin, “Scalingof matrices to achieve specified row and column sums,” NumerischeMathematik 12, pp. 83-90, 1968; M. V. Menon and H. Schneider, “Thespectrum of a nonlinear operator associated with a matrix,” LinearAlgebra and its Applications 2, pp. 321-34, 1969; P. Novosad, “On theintegral equation Kf=1/f arising in a problem in communication,” Journalof Mathematical Analysis and Applications 14, pp. 484-92, 1966; T. E. S.Raghavan, “On pairs of multidimensional matrices,” Linear Algebra andApplications 62, pp.263-68, 1984; U. G. Rothblum, “Generalized scalingsatisfying linear equations,” Linear Algebra and Applications 114/115,pp. 765-83, 1989; U. G. Rothblum and H. Schneider, “Scalings of matriceswhich have prescribed row sums and column sums via optimization,” LinearAlgebra and Applications 114/115, pp.737-64, 1989; U. G. Rothblum and H.Schneider, “Characterization of optimal scaling of matrices,”Mathematical Programming 19, pp. 121-36, 1980; U. G. Rothblum, H.Schneider and M. H. Schneider, “Scaling matrices to prescribed row andcolumn maxima,” SIAM J. Matrix Anal. Appl. 15, pp. 1-14, 1994; B. D.Saunders and H. Schneider, “Flows on graphs applied to diagonalsimilarity and diagonal equivalence for matrices,” Discrete Mathematics24, pp. 205-20, 1978; B. D. Saunders and H. Schneider, “Cones, graphsand optimal scaling of matrices,” Linear and Multilinear Algebra 8, pp.121-35, 1979; M. H. Schneider, “Matrix scaling, entropy minimization andconjugate duality. I Existence conditions,” Linear Algebra andApplications 114/115, pp. 785-813, 1989; R. Sinkhorn, cited above; R.Sinkhorn and P. Knopp, “Concerning nonnegative matrices and doublystochastic matrices,” Pacific J. Math. 212, pp. 343-48, 1967, and hasdiverse applications in economics, operations research, and statistics.

Several of the algorithms described here were tested numerically. Wedescribe their performance both on actual speech data and data generatedfrom various standard probability density functions. However, werestrict our numerical experiments to scalar data and will describeelsewhere statistics on word error rate on the Wall Street Journalspeech data base, as used in S. Basu and C. A. Micchelli, cited above.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing and other objects, aspects and advantages will be betterunderstood from the following detailed description of a preferredembodiment of the invention with reference to the drawings, in which:

FIG. 1 is a convergence plot of the iteration methods given by PenalizedMLE and Baum-Welch;

FIG. 2A is a plot of the Baum Welch estimator for the multiquadric andthe cosine matrix;

FIG. 2B is a plot of the degree raising estimator for the multiquadricand the cosine matrix;

FIG. 3 is a plot of the PMLE, the Parzen estimator and the actualprobability density for n=500, h=0.3 and the gaussian kernel;

FIG. 4 is a plot of the Baum Welch density estimator, the Parzenestimator and the actual probability density for n=500, h=0.3 and thegaussian kernel;

FIG. 5A is a plot of the Baum Welch estimator for n=500, h=0.3 and thegaussian kernel;

FIG. 5B is a plot of the Parzen estimator for n=500, h=0.3 and thegaussian kernel;

FIG. 6A is a plot of the Baum Welch estimator and the actual probabilitydensity for n=500, h=0.6 and the gaussian kernel;

FIG. 6B is a plot of the Baum Welch estimator and the actual probabilitydensity for n=2000, h=0.3 and the gaussian kernel;

FIG. 7A is a plot of the Baum Welch estimator for n=2000, h=0.3 and thespline kernel f_(i) for i=1;

FIG. 7B is a plot of the Baum Welch estimator for n=2000, h=0.3 and thespline kernel f_(i) for i=2;

FIG. 8A are plots of the histogram, the Baum Welch estimator and theParzen estimator for leaf 1 of ARC AA_1, dimension 0;

FIG. 8B are plots of the histogram, the Baum Welch estimator and theParzen estimator for leaf 2 of ARC AA_1, dimension 0;

FIG. 8C are plots of the histogram, the Baum Welch estimator and theParzen estimator for leaf 7 of ARC AA_1, dimension 0;

FIG. 8D are plots of the histogram, the Baum Welch estimator and theParzen estimator for leaf 11 of ARC AA_1, dimension 0;

FIG. 8E are plots of the histogram, the Baum Welch estimator and theParzen estimator for leaf 5 of ARC AA_1, dimension 25; and,

FIG. 8F are plots of the histogram, the Baum Welch estimator and theParzen estimator for leaf 2 of ARC AA_1, dimension 25.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS OF THE INVENTION PenalizedMaximum Likelihood Estimation

Let x¹, . . . ,x^(n) be independent observations in R^(d) from someunknown random variable with probability density function (PDF)ƒ. Thelikelihood function of the data $\begin{matrix}{{L(f)}:={\prod\limits_{ \in Z_{n}}\quad {f\left( x^{i} \right)}}} & (7)\end{matrix}$

is used to assess the value of a specific choice for ƒ. In a parametricstatistical context a functional form of ƒ is assumed. Thus, we supposethat ƒ=ƒ(•, a) where a is an unknown parameter vector that determines ƒ.For example, ƒ may be a mixture model of gaussian PDFs with unknownmeans, covariances and mixture weights. The parametric MaximumLikelihood Estimator (MLE) point of view tries to maximize thelikelihood function over a. Such a data dependent choice of a is thenused to specify the density ƒ. If no information on ƒ is known (orassumed) then it is clear that the likelihood function can be madearbitrarily large with a PDF that is concentrated solely at the datapoints x^(i), iεZ_(n). When such density functions are used forclassification algorithms the phenomenon of over training results. Thus,only actual training data can be classified and no others. To remedythis problem it has been suggested that one penalizes the likelihoodfunction for oscillations in its derivatives, see I. J. Good and R. A.Gaskin, cited above. It is this point of view which we study here forthe purpose of classification of acoustic vectors in speech recognition.

Let us recall the setup for penalized likelihood estimation. We let H bea Hilbert space of real-valued functions on R^(d) for which pointevaluations is a continuous linear functional. In other words, H is areproducing kernel Hilbert space, cf. S. Saitoh, cited above. Therefore,there exists a real-valued k(x,y), x, yεR^(d) such that for everyxεR^(d) the function k(x,•) is in H and for every ƒ in H we have

ƒ(x)=<k(x, •), ƒ>  (8)

where <•,•> represents the inner product on H. Recall that for any x¹, .. . ,x^(n) in R^(d) the matrix

K={k(x ^(i) ,x ^(j))}_(i,jεZ) _(n)   (9)

is symmetric and positive semi-definite. Moreover, when the pointevaluation functional at the data x¹, . . . ,x^(n) are linearlyindependent this matrix is positive definite.

Corresponding to this Hilbert space H and data x¹, . . . ,x^(n) thepenalized likelihood function is defined to be $\begin{matrix}{{P(f)} = {\left( {\prod\limits_{ \in Z_{n}}\quad {f\left( x^{i} \right)}} \right)^{{{- 1}/2}{}f{}^{2}}}} & (10)\end{matrix}$

where ∥−∥ is the norm on H. Maximizing this function over H for specificHilbert spaces has been given in I. J. Good and R. A. Gaskin, citedabove; J. R. Thompson and R. Tapia, Nonparametric Probability DensityEstimation, The Johns Hopkins University Press, Baltimore, 1978; J. R.Thompson and R. Tapia, Nonparametric Function Estimation, Modeling andSimulation, SIAM, Philadelphia, 1970. For example, the PMLE for scalardata relative to a Sobolev norm has been obtained in J. R. Thomopson andR. Tapia, cited above.

Since our motivation in this paper comes from speech recognition, thevalue of n is typically 5,000 and the dimension d is 39. Moreover,density estimators are needed for as many as 4,000 groups of vectors,cf. S. Basu and C. A. Micchelli, cited above; F. Jelinek, cited above.Although the ideas from C. A. Micchelli and F. I. Utreras, “Smoothingand interpolation in a convex subset of a Hilbert Space,” SIAM Journalof Scientific and Statistical Computing 9, pp. 728-46, 1988; C. A.Micchelli and F. I. Utreras, “Smoothing and interpolation in a convexsubset of a semi-Hilbert Space,” Mathematical Modeling and NumericalMethods 4, pp. 425-40, 1991, should be helpful to solve the large scaleoptimization problem of maximizing P over all PDFs in a suitably chosenHilbert space, this seems to be computationally expensive. Thus, ourstrategy is to remove some of the constraints on ƒ and maximize theabsolute value of P(ƒ) over all ƒεH. As we shall see, this is an easiertask. After determining the parametric form of such an ƒ, we will thenimpose the requirement that it be a PDF.

Let us begin by first recalling that the penalized maximum likelihooddoes indeed have a maximum over all of H. To this end, we first pointout that there exists a positive constant C such that for all ƒ in H

|P(ƒ)|≦C∥ƒ∥ ^(n) e ^(−½∥ƒ∥) ² .  (11)

Consequently, we conclude that the function P is bounded above on H bysome positive constant B. Let {ƒ_(l):lεN}, N={1,2, . . . } be anysequence of functions in H such that lim_(l→∞)P(ƒ_(l))=B. Then, for allbut a finite number of elements of this sequence, we have that$\begin{matrix}{{{}f_{l}{}} \leq {4{\left( \frac{{Cn}!}{B} \right)^{1/n}.}}} & (12)\end{matrix}$

Therefore some subsequence of {ƒ_(l):lεN} converges weakly in H to amaximum of P.

Theorem 1: Suppose H is a reproducing Hilbert space with reproducingkernel k(•, •). If |P(ƒ)|=max{|P(g)|:gεH} for some f in H then thereexist constants c_(i), iεZ_(n) such that for all xεR^(d) $\begin{matrix}{{f(x)} = {\sum\limits_{ \in Z_{n}}\quad {c_{i}{{K\left( {x,x^{i}} \right)}.}}}} & (13)\end{matrix}$

Proof: Let y_(i)=ƒ(x^(i)), iεZ_(n). Since ƒ maximizes |P(h)|, hεH, weconclude that y_(i)≠0 for iεZ_(n). Let g be any other function in H suchthat g(x^(i))=y_(i), iεZ_(n). By the definition of ƒ we have that

 |g(x ¹) . . . g(x ^(n))|e ^(−½∥g∥) ² ≦|ƒ(x ¹) . . . ƒ(x ^(n))|e^(−½∥ƒ∥) ²

from which we conclude that

∥ƒ∥=min{∥g∥:g(x ^(i))=y _(i) , iεZ _(n) , gεH}.

The fact that ƒ has the desired form now follows from a well-knownanalysis of this problem. We recall these details here. For anyconstants a₁, . . . ,a_(n) we have that $\begin{matrix}{{{\sum\limits_{ \in Z_{n}}{a_{i}y_{i}}}} = \quad {{{< {\sum\limits_{ \in Z_{n}}{a_{i}{k\left( {x^{i}, \cdot} \right)}}}},{g >}}}} \\{\leq \quad {{}{\sum\limits_{ \in Z_{n}}{a_{i}{k\left( {x^{i}, \cdot} \right)}{}{}g{}}}}}\end{matrix}$

which implies that $\begin{matrix}{{{}g{}} \geq {\max {\left\{ {{\frac{{a^{T}y}}{{}{\sum\limits_{ \in Z_{n}}{a_{i}{k\left( {x^{i}, \cdot} \right)}{}}}}:a} = {\left( {a_{1},\ldots \quad,a_{n}} \right)^{T} \in R^{n}}} \right\}.}}} & (14)\end{matrix}$

To achieve equality above we choose constants c₁, . . . ,c_(n) so thatthe {tilde over (ƒ)} function defined by${{\overset{\sim}{f}(x)} = {\sum\limits_{ \in Z_{n}}{c_{i}{k\left( {x,x^{i}} \right)}}}},{x \in R^{n}}$

satisfies the equations y_(i)=ƒ(x^(i)), iεZ_(n). Therefore, we have that

∥{tilde over (ƒ)}² =c ^(T) Kc

where K={k(x^(i),x^(j))}_(i,jεZ) _(n) and c=(c₁, . . . ,c_(n))^(T).Since Kc=y, where y=(y₁, . . . ,y_(n))^(T), we also have that$\frac{c^{T}y}{{}{\sum\limits_{ \in Z_{n}}{c_{i}{K\left( {x^{i}, \cdot} \right)}{}}}} = {{}\overset{\sim}{f}{{}.}}$

Although the PMLE method allows for a nonparametric view of densityestimation it is interesting to see its effect on a standard parametricmodel, for instance a univariate normal with unknown mean and variancerelative to the Sobolev norm on R. To this end, recall the m-th Sobolevnorm is defined to be

∥ƒ∥_(m) ²=∫_(R)|ƒ^((m))(t)|² dt.

For the normal density with mean μ and variance σ², given by${{f(t)} = {\frac{1}{\sqrt{2{\pi\sigma}}}^{- \frac{{({t - \mu})}^{2}}{2\sigma^{2}}}}},{t \in R}$

we get that

∥ƒ∥_(m) ²=2ρ_(m)σ^(−2m−1)

where$\rho_{m} = {\frac{2^{{{- 2}m} - 3}}{\pi}{{\Gamma \left( {m + \frac{1}{2}} \right)}.}}$

Then the PMLE estimates for the mean and variance are given by$\hat{\mu} = {\frac{1}{n}{\sum\limits_{ \in Z_{n}}x_{i}}}$

and

{circumflex over (σ)}=v

where v is the unique positive root of the equation${v^{{2m} - 1}\left( {v^{2} - S^{2}} \right)} = {\frac{{2m} + 1}{n}\rho_{m}}$

and$S^{2} = {\frac{1}{n}{\sum\limits_{ \in Z_{n}}{\left( {x_{i} - \hat{\mu}} \right)^{2}.}}}$

Note that v is necessarily greater than S, but as n→∞ it converges to S.

Penalized Maximum Likelihood Estimators

In the previous section we provided a justification for our densityestimator to have the form $\begin{matrix}{{{f(x)} = {\sum\limits_{ \in Z_{n}}{c_{i}{k\left( {x^{i},x} \right)}}}},{x \in R^{d}}} & (15)\end{matrix}$

where k(x,y), x,yεR^(d) is a reproducing kernel for a Hilbert space offunctions on R^(d). In general, this function is neither nonnegative onR^(d) nor does it have integral one. We take the point of view that inapplications the kernel will be chosen rather than the Hilbert space H.Thus, we will only consider kernels which are nonnegative, that is,k(x,y)≧0, x,yεR^(d).

We note in passing that there are noteworthy examples of Hilbert spaceswhich have nonnegative reproducing kernels. For example, given anypolynomial

q(t)=q _(O) +q ₁ t+ . . . +q _(m) t ^(m) , tεR  (16)

which has a positive leading coefficient and only negative zeros it canbe confirmed that the Hilbert space of functions ƒ on R with finite norm$\begin{matrix}{{f}^{2} = {\sum\limits_{j = 0}^{m}{q_{j}{\int_{R}{{{f^{(j)}(t)}}^{2}{t}}}}}} & (17)\end{matrix}$

is a reproducing kernel Hilbert space with a nonnegative reproducingkernel.

With a given nonnegative kernel we now view the density ƒ in equation(15) as being parametrically defined and now discuss the problem ofdetermining its coefficients c₁, . . . ,c_(n). Our first choice is todetermine these parameters by substituting the functional form of (15)into the penalized maximum likelihood function (10) and maximize theresulting expression. To this end, we let

b _(i)=∫_(R) _(^(d)) k(x,x ^(i))dx, iεZ _(n)

and introduce the set

S ^(n)(b)={c:cεR ^(n) , b ^(T) c=1}.

We recall that the n×n matrix

K={k(x ^(i) ,x ^(j))_(i,jεZ) _(n) }

is symmetric, positive definite, and has nonnegative elements. Underthese conditions our concern is to maximize the function $\begin{matrix}{{P_{K}(c)} = {{\Pi ({Kc})}^{{- \frac{1}{2}}c^{T}{Kc}}}} & (18)\end{matrix}$

over the set S^(n)(b) where for any x=(x₁, . . . ,x_(n))^(T)εR^(n) weset ${{\Pi (x)} = {\prod\limits_{i \in Z_{n}}x_{i}}},$

We also use ${x}^{2} = {\sum\limits_{i \in Z_{n}}x_{i}^{2}}$

for the euclidean norm of x. We adopt the convention thatmultiplication/division of vectors is done component-wise. Inparticular, when all the components of a vector y=(y₁, . . . ,y_(n))^(T)are nonzero we set$\frac{x}{y}:={\left( {\frac{x_{1}}{y_{1}},\ldots \quad,\frac{x_{n}}{y_{n}}} \right).}$

For the vector e/y we use the shorthand notation y⁻¹, set

x·y:=(x ₁ y ₁ , . . . ,x _(n) y _(n))^(T)

for multiplication of vectors and use the notation S^(n) for the setS^(n)(e) which is the standard simplex in R^(n). If ƒ is a scalar valuedfunction and x a vector all of whose coordinates lie in its domain, weset ƒ(x):=(ƒ(x₁), . . . ,ƒ(x_(n)))^(T). When we write x≧y we mean thisin a coordinate-wise sense.

Note that P_(K)(c)=P(ƒ) where ƒ is the function in equation (15) and Pis the penalized likelihood function (10). Also, we use the notationS^(n) for the set S^(n)(e) which is the standard simplex in R^(n).

Theorem 2: Let K be a positive definite matrix with nonnegativeelements. Then the function P_(K) has a unique maximum on any closedconvex subset C of R₊ ^(n).

Proof: The existence of a maximum is argued just as in the case of theexistence of the maximum of the penalized likelihood function P inequation (10) over the Hilbert space H. To demonstrate that the maximumis unique we let P_(∞) be the maximum value of P_(K) on C. Ourhypothesis ensures that P_(∞) is positive. Suppose the function P_(K)attains P_(∞) at two distinct vectors c¹ and c² in C. Consequently, thevectors Kc¹ and Kc² are in intR₊ ^(n). Therefore, for any 0≦t≦1 thevector K(tc¹+(1−t)c²) is likewise in intR₊ ^(n). A direct computation ofthe Hessian of P_(K) at x, denoted by ∇² log P_(K)(x) verifies, for anyvector x,yεR^(n) with Kxε(R\{0})^(n) that${y^{T}{\nabla^{2}{P_{K}(x)}}y} = {{{- {\frac{Ky}{Kx}}^{2}} - {y^{T}{Ky}}} < 0.}$

Choosing y=c¹−c² and c=c² we conclude from the above formula that thefunction log P_(K)(tc¹+(1−t)c²) is strictly concave for 0≦t≦1. But thiscontradicts the fact that its value at the endpoints are the same.

Remark: Observe that without the condition that K is nonsingular P_(K)will in general have more than one maximum. For example, when n=2,b=(1,1)^(T) and $K = \begin{pmatrix}1 & 1 \\1 & 1\end{pmatrix}$

P_(K) is everywhere equal to e^(−½) on S²(b). In general, the proofabove reveals the fact that if P_(K) achieves its maximum at two vectorsc¹ and c² in C then K(c¹−c²)=0.

If the maximum of P_(K) on S^(n)(b) occurs in the interior of S^(n)(b)at the vector c then it is uniquely determined by the stationaryequation

K(c−(Kc)⁻¹)=λb

where λ is a scalar given by the equation

λ=c ^(T) Kc−n.

In the case that K is a diagonal matrix, that is, K=diag(k₁₁, . . .,k_(nn)) where k_(ii)>0, iεZ_(n) the maximum of P_(K) on S^(n)(b) occursat the vector c whose coordinates are all positive and given by theequation${c_{i} = \frac{\lambda + \sqrt{\lambda^{2} + {4k_{ii}}}}{2k_{ii}}},{i \in Z_{n}}$

where λ is the unique real number that satisfies the equation$\begin{matrix}{1 = {\sum\limits_{i \in Z_{n}}{\frac{\lambda + \sqrt{\lambda^{2} + {4k_{ii}}}}{2k_{ii}}.}}} & (19)\end{matrix}$

We remark that the right hand side of equation (19) is an increasingfunction of λ which is zero when λ=−∞ and infinity when λ=∞.

Theorem 3: Suppose K is an n×n positive definite matrix with nonnegativeelements such that Ke=e. Then the vector $\frac{1}{n}e$

is the unique maximum of P_(K) on S^(n).

Proof: For any cεS^(n) the geometric inequality implies that$\left\{ {P_{K}(c)} \right\}^{1/n} \leq {\frac{1}{n}^{T}{{Kc}^{{- \frac{1}{2n}}c^{T}{Kc}}.}}$

Moreover, since e^(T)Kc=e^(T)c=1 the Cauchy Schwarz inequality impliesthat 1≦nc^(T)Kc. Therefore${\left\{ {P_{K}(c)} \right\}^{1/n} \leq {\frac{1}{n}^{- \frac{1}{2n^{2}}}}} = {\left\{ {P_{K}\left( {\frac{1}{n}e} \right)} \right\}^{1/n}.}$

We now consider the following problem. Starting with a vector c=(c₁, . .. c_(n)) in intS^(n)(b) which is not the maximum of P_(K) we seek avector v=(v₁, . . . ,v_(n))^(T) in intS^(n)(b) such thatP_(K)(v)>P_(K)(c). That is, “updating” c with v will increase P_(K) andtherefore eventually converge to the maximum of P_(K) on S^(n)(b). Tothis end, we consider the quantity log P_(K)(v)−log P_(K)(c). We shallbound it from below by using Jensen's inequality which ensures for anyvector aεintR₊ ^(n) and zεS^(n) that log z^(T)a≧z^(T) log a. To make useof this fact we define for any fixed iεZ_(n), vectors z^(i)εS^(n), andaεintR₊ ^(n) the equations${\left( z^{i} \right)_{j} = \frac{K_{ij}c_{j}}{({Kc})_{i}}},{j \in Z_{n}}$

and $a = {\frac{v}{c}.}$

Also, we set $y = {\sum\limits_{i \in Z_{n}}z^{i}}$

and note that the vector $\frac{1}{n}y$

is in intS^(n) since K is a nonsingular matrix with nonnegative elementsand the vector c is in R₊ ^(n). Therefore, we get that $\begin{matrix}{{{\log \quad {P_{K}(v)}} - {\log \quad {P_{K}(c)}}} = \quad {{\sum\limits_{i \in Z_{n}}{\log \quad \frac{({Kv})_{i}}{({Kc})_{i}}}} - {\frac{1}{2}v^{T}{Kv}} + {\frac{1}{2}c^{T}{Kc}}}} \\{= \quad {{\sum\limits_{i \in Z_{n}}{\log \quad \left( z^{i} \right)^{T}a}} - {\frac{1}{2}v^{T}{Kv}} + {\frac{1}{2}c^{T}{Kc}}}} \\{\geq \quad {{y^{T}\log \quad a} - {\frac{1}{2}v^{T}{Kv}} + {\frac{1}{2}c^{T}{{Kc}.}}}}\end{matrix}$

This inequality suggests that we introduce the auxiliary function

W(v)=y ^(T) log v−½v ^(T) Kv, vεintR ₊ ^(n)

and rewrite the above inequality in the form

log P _(K)(v)−log P _(K)(c)≧W(v)−W(c).

This function W is strictly log concave on intR₊ ^(n) and tends to minusinfinity on its boundary. Therefore it has a unique maximum inintS^(n)(b). Using the principle of Lagrange multipliers there exists aconstant γ such that the vector v at which W takes its maximum inintS^(n)(b) is characterized by the equation

y/v−Kv−γb=0.

Taking the inner product of both sides of this equation with v and usingthe fact that vεintS^(n)(b) yields the equation

γ=n−v ^(T) Kv.

We formalize these remarks in the following theorem.

Theorem 4: Let K be a symmetric positive definite matrix withnonnegative elements and bεintR₊ ^(n). Given any vector cεintS^(n)(b)there exists a unique vector vεintS^(n)(b) satisfying the equations

v·Kv=K((Kc)⁻¹ ·c−γv·b  (20)

where

γ=n−v ^(T) Kv.  (21)

This vector has the property that P_(K)(v)>P_(K)(c) as long as c is notthe unique maximum of P_(K) on S^(n)(b).

The function H defined by the equation H(c)=v, vεintS^(n)(b) mapsS^(n)(b) into itself. By the uniqueness of v satisfying (20) we see thatH is continuous. The mapping H has a continuous extension S^(n)(b) toand by the Browder fixed point theorem has a fixed point in S^(n)(b) atsome vector u. This vector satisfies the equation

u·Ku=K((Ku)⁻¹)·u−γu·b.

These equations by themselves do not define the unique maximum of P_(K)on S^(n)(b). If we call this vector c then it is the unique solution ofthe equations

K((Kc)⁻¹ −c)=γb−z,

where γ is a scalar given by the equation

γ=n−c ^(T) Kc

and z is a vector in R₊ ^(n) satisfying the equation z·c=0. To derivethis fact we recall that a concave function ƒ has its maximum onS^(n)(b) at x if and only if there is a γεR and azεR₊ ^(n) such thatz^(T)x=0 and ∇ƒ(x)=γb−z. Specializing this general fact to the case athand verifies the above fact. In general the maximum of P_(K) may occuron the boundary of S^(n)(b).

The iteration embodied in the above result is quite appealing as itguarantees an increase in the penalized likelihood function P_(K) ateach step unless we have reached its maximum on S^(n)(b). However, tocompute the updated vector seems computationally expensive and so weconsider other methods for maximizing P_(K) over S^(n)(b).

To this end, we recall that whenever a function F is a homogenouspolynomial with positive coefficients then the Baum Welch algorithm saysthat the update formula$v = \frac{c \cdot {\nabla{F(c)}}}{\left( {b \cdot c} \right)^{T}{\nabla{F(c)}}}$

increases F on S^(n)(b), i.e. F(v)>F(c) whenever cεS^(n)(b) is not themaximum of F on S^(n)(b), see L. E. Baum and J. A. Eagon, cited above.Since P_(K) is not a polynomial, applying the Baum Welch iteration to itwill generally not increase it. One modification of the Baum Welchiteration which we have some computational experience with starts with apositive parameter σ and defines$v = {\frac{c \cdot {K\left( {({Kc})^{- 1} - {\sigma \quad e}} \right)}}{n - {\sigma \quad c^{T}{Kc}}}.}$

This update formula is inexpensive to use and our numerical experimentsindicate it gives good results provided that σ is chosen with care.

Using the strong duality theorem for convex programs, cf. S. Whittle,“Optimization under Constraints,” Wiley Interscience, 1971, the dualminimization problem for the primal convex program of minimizing −logP_(K) over the domain

W:=S ^(n)(b)∩A ⁻¹(intR ₊ ^(n))

is

min{−log P _(K)(y):yεW}=max{θ(u,t):uεR ₊ ^(n) , tεR}

where

θ(u,t):=min{½x ^(T) Ax−u ^(T) x+t(b ^(T) x−1):xεA ⁻¹(intR ₊ ^(n))}.

However, the dual problem does not seem to offer any advantage over theprimal. Interior point methods seem to have a strong potential to handlethe primal problem when K is a large sparse matrix. It is interesting tonote that our primal problem falls into the problem class studied in K.Tok, “Primal-dual path-following algorithms for determinant maximizationproblems with linear matrix inequalities,” Computational Optimizationand Applications, Kluwer Academic Publishers, Boston.

We end this section with a number of comments on the problem ofmaximizing P_(K) over the positive orthant R₊ ^(n). This problem hassome unexpected connections with the problem of the diagonal similarityof a symmetric matrix with nonnegative elements to a doubly stochasticmatrix. We record below the information we need about this problem.

The following lemma is well known, cf. M. Marcus and M. Newman, citedabove; R. Sinkhorn, cited above, and is important to us. Recall that ann×n matrix A is said to be strictly copositive provided that x^(T)Ax>0for all xεR₊ ^(n)\{0}, cf. C. A. Micchelli and A. Pinkus, cited above,and references therein.

Lemma 1: Let A be an n×n symmetric strictly copositive matrix. For anyvector yεintR₊ ^(n) there exists a unique vector xεintR₊ ^(n) such that

x·Ax=y.  (22)

Proof: First we prove the existence of a vector x which satisfies (22).To this end, we follow A. W. Marshall and I. Olkin, cited above, andconsider the set

H ^(n)(y):={u:uεintR ₊ ^(n), Π(u ^(y))=1,}

where u^(y)=(u₁ ^(y) ^(₁) , . . . ,u_(n) ^(y) ^(_(n)) )^(T), y=(y₁, . .. ,y_(n))^(T), and u=(u₁, . . . ,u_(n))^(T). By hypothesis, there existssome ρ>0 such that for all uεR₊ ^(n) the inequality u^(T)Au≧ρu^(T)uholds. Thus, the function ½u^(T)Au takes on its minimum value onH^(n)(y) at some z. Therefore, by Lagrange multipliers there is aconstant μ such that

Az=μy/z.

Since 0<z^(T)Az=(e^(T)y)μ we see that μ>0 and so the vector we want isgiven by x=z/{square root over (μ)}. This establishes the existence ofx. Note that the vector$u = \frac{x}{\left( {\Pi \left( x^{y} \right)} \right)^{\frac{1}{^{T}y}}}$

the unique vector in H^(n)(y) which minimizes ½u^(T)Au on H^(n)(y).

The uniqueness of x in (22) is established next. Suppose there are twovectors z¹ and z² which satisfy equation (22). Let B be an n×n matrixwhose elements are defined to be${B_{ij} = {\frac{z_{i}^{2}}{y_{i}}A_{ij}z_{j}^{2}}},i,{j \in {Z_{n}.}}$

Then (22) implies that

 Be=e,  (23)

and

Bv=v ⁻¹,  (24)

when v:=z¹/z². Let M:=max{v_(j):jεZ_(n)}=v_(r) andm:=min{v_(j):jεZ_(n)}=v_(s) for some r, sεZ_(n). From (23) and (24) weobtain that

$\begin{matrix}{\frac{1}{m} = {\frac{1}{v_{s}} = {({Bv})_{s} \leq M}}} & (25)\end{matrix}$

and also

$\begin{matrix}{\frac{1}{M} = {\frac{1}{v_{r}} = {({Bv})_{r} \geq {m.}}}} & (26)\end{matrix}$

Thus we conclude that Mm=1 and from (25) that (B(v−Me))_(s)=0. SinceB_(ss)>0 we get m=v_(s)=M. Therefore, m=M=1 which means v=e; in otherwords, z¹=z².

We shall apply this lemma for the Baum Welch update for the problem ofmaximizing the function P_(K) given in (18) over the positive orthant R₊^(n), but first we observe the following fact.

Theorem 5: Let K be a symmetric positive definite matrix withnonnegative entries. Then P_(K) achieves its (unique) maximum on R₊ ^(n)at the (unique) vector x=(x₁, . . . x_(n))^(T) in intR₊ ^(n) whichsatisfies the equation

x·Kx=e.  (27)

Proof: Let c be any vector in R^(n) such that Kcε(R\{0})^(n). Since

∇P _(K)(c)=P _(K)(c)K(c−(Kc)⁻¹),

we see that the vector which satisfies (27) is a stationary point ofP_(K). We already proved that P_(K) is log concave and so the resultfollows.

Note that when K satisfies the hypothesis of Theorem 3 the vector x inequation (27) is e and moreover it furnishes the maximum of P_(K) onS^(n). In the next result we clarify the relationship between theminimum problem in Lemma 1 and the problem of maximizing P_(K) over R₊^(n). For this purpose, we use the notation H^(n) for the set H^(n)(e).

Theorem 6: Suppose that K is an n×n symmetric positive definite matrixwith nonnegative elements. Let

p _(K):=max{P _(K)(c):cεR ₊ ^(n)}

and

q _(k):=min{½u ^(T) Ku:uεH ^(n)}.

Then $p_{K} = {\left( \frac{2q_{K}}{n\quad } \right)^{n/2}.}$

Proof: Let x be the unique vector in the interior of R₊ ^(n) such that

x·Kx=e.

Since x is a stationary point of P_(K) we have that$p_{K} = {{\Pi ({Kx})}{^{{- \frac{1}{2}}x^{T}{Kx}}.}}$

Moreover, the vector $u = \frac{x}{\left( {\Pi (x)} \right)^{1/n}}$

in H^(n) is the unique solution of the minimum problem which defines theconstant q_(K). Thus,${2q_{K}} = {{u^{T}{Ku}} = \frac{n}{\left( {\Pi (x)} \right)^{2/n}}}$

and $p_{K} = {\frac{^{{- n}/2}}{\Pi (x)}.}$

Eliminating Π(x) from these equations proves the result.

This result justifies the exchange in the max and min in the followingresult. To this end, we introduce the semi-elliptical region

E _(K) ^(n) ={c:cεR ₊ ^(n) , c ^(T) Kc=1}.

Theorem 7: Suppose K is an n×n symmetric positive definite matrix withnonnegative elements. Then

max{min{u ^(T) Kc:uεH ^(n) }:cεE _(K) ^(n)}=min{max{u ^(T) Kc:uεH ^(n)}:cεE _(K) ^(n)}.

To prove this fact we use the following consequence of the arithmeticgeometric inequality.

Lemma 2: For any xεR₊ ^(n) we have that$\left( {\Pi (x)} \right)^{1/n} = {\min {\left\{ {\frac{1}{n}u^{T}{x:{u \in H^{n}}}} \right\}.}}$

Proof: By the arithmetic geometric inequality we have $\begin{matrix}{{{\frac{1}{n}u^{T}x} \geq \left( {\prod\limits_{i \in Z_{n}}{u_{i}x_{i}}} \right)^{1/n}} = {\Pi (x)}^{1/n}} & (28)\end{matrix}$

provided that Π(u)=1. This inequality is sharp for the specific choiceu=Π(x)^(1/n)x⁻¹.

Proof of Theorem 7: First, we observe that $\begin{matrix}{p_{K}^{1/n} = {\max \left\{ {{\max \left\{ {{\Pi ({Kc})}^{\frac{1}{n}}{^{{- \frac{1}{2n}}c^{T}{Kc}}:{c \in {\sqrt{t}E_{K}^{n}}}}} \right\}}:{t \in R_{+}}} \right\}}} \\{= {\max {\left\{ {^{- \frac{t}{2n}}\sqrt{t:{t \in R_{+}}}} \right\} \cdot \max}\left\{ {{\Pi ({Kc})}^{1/n}:{c \in E_{K}^{n}}} \right\}}} \\{= {\frac{1}{\sqrt{n\quad }}\max {\left\{ {{\min \left\{ {u^{T}{{Kc}:{u \in H^{n}}}} \right\}}:{c \in E_{K}^{n}}} \right\}.}}}\end{matrix}$

Next, we observe by the Cauchy Schwarz inequality that $\begin{matrix}{\sqrt{\frac{2q_{K}}{n\quad }} = {\frac{1}{\sqrt{n\quad }}\min \left\{ {\sqrt{u^{T}{Ku}}:{u \in H^{n}}} \right\}}} \\{= {\frac{1}{\sqrt{n\quad }}\min {\left\{ {{\max \left\{ {u^{T}{{Kc}:{c \in E_{K}^{n}}}} \right\}}:{u \in H^{n}}} \right\}.}}}\end{matrix}$

We remark that (27) implies that xε{square root over (n)}E_(K) ^(n) whenK satisfies the conditions of Theorem 5. Let us show directly that thisis the case for any maxima of P_(K) on R₊ ^(n) even when K is onlysymmetric semi-definite. This discussion shall lead us to a Baum Welchupdate formula for maximizing P_(K) on R₊ ^(n) even when K is onlysymmetric semi-definite.

Suppose v is any maximum of P_(K) in R₊ ^(n). Then for any t≧0

P _(K)(tv)=t ^(n)Π(Kv)e ^(−t) ² ^(γ) ≦P _(K)(v)=Π(Kv)e ^(−γ)

γ:=½v^(T)Kv. Thus the function h(t):=P_(K)(tv), tεR₊ achieves itsmaximum on R₊ at t=1. However, a direct computation confirms that itsonly maximum occurs at $\sqrt{\frac{n}{2y}}.$

Thus, we have confirmed that vε{square root over (n)}E_(K) ^(n). Let$\begin{matrix}{\hat{c} = \frac{v}{\sqrt{n}}} & (29)\end{matrix}$

and observe that for every vector cεE_(K) ^(n)

 Π(Kc)≦Π(Kĉ).

Thus, the vector ĉ maximizes the function M_(K)(c):=Π(Kc) over E_(K)^(n). Conversely, any vector c defined by equation (29), which maximizesM_(K) over E_(K) ^(n), maximizes P_(K) over R₊ ^(n). Thus it suffices toconsider the problem of maximizing the homogenous polynomial M_(K) overthe semi-elliptical set E_(K) ^(n).

Before turning our attention to this problem, let us remark that thefunction M_(K) has a unique maximum on E_(K) ^(n) whenever K is apositive definite matrix with nonnegative elements. To see this, weobserve that the Hessian of log M_(K) at any vector cεE_(K) ^(n) withKcεintR₊ ^(n) is negative definite. This fact is verified by directcomputation as in the proof of Theorem 2, see also Theorem 10. Thus,M_(K) has a unique maximum on the convex set

Ê _(K) ^(n) ={c:cεR ₊ ^(n) , c ^(T) Kc≦1}.

Suppose that the maximum Of M_(K) in Ê_(K) ^(n) occurs at u while itsmaximum on E_(K) ^(n) is taken at v. Then, by the homogeneity of M_(K)we have that

M _(K)(v)≦M _(K)(u)≦(u ^(T) Ku)^(n/2) M _(K)(v)≦M _(K)(v).

Hence u^(T)Ku=1 and v must be the maximum of M_(K) in Ê_(K) ^(n) aswell. That is, u=v and v is unique. We now study the problem ofmaximizing M_(K) over E_(K) ^(n) in the following general form.

To this end, let G be any function of the form${{G(x)} = {\sum\limits_{i \in Z_{+}^{n}}{g_{i}x^{i}}}},{x \in E_{K}^{n}}$

where g_(i), iεZ₊ ^(n) are nonnegative scalars. We require that the sumbe convergent in a neighborhood of E_(K) ^(n). Also, there should be atleast one jεintZ₊ ^(n) such that g_(j)>0. Certainly the function M_(K)has all of these properties when K is a nonsingular matrix withnonnegative elements, which covers the case that interests us.

For xεE_(K) ^(n) we have that

x·∇G(x)≧jg _(j) x ^(j),

and so x·∇G(x)εintR₊ ^(n). Consequently, the vector z, defined by theequations $z = \frac{x \cdot {\nabla{G(x)}}}{x^{T}{\nabla{G(x)}}}$

is intS^(n). Using Lemmna 1 there is a unique vector y in the interiorof intE_(K) ^(n) such that y·Ky=z. We claim that

G(x)≦G(y).  (30)

To confirm this we consider the function${{Q(v)}:={{\sum\limits_{ \in Z_{+}^{n}}\frac{\left( {i^{T}\log \quad v} \right)g_{i}x^{i}}{x^{T}{\nabla{G(x)}}}} = {z^{T}\log \quad v}}},{v\quad \in \quad {{int}\quad {E_{K}^{n}.}}}$

Since zεintR₊ ^(n), Q has a maximum in int E_(K) ^(n). Suppose that themaximum of Q occurs at wεintE_(K) ^(n). Thus, by Lagrange multipliersthere is a constant ρ such that $\frac{z}{w} = {\rho \quad {{Kw}.}}$

Since wεE_(K) ^(n) and zεS^(n) we have ρ=1. Thus, we see that w=y is theunique maximum of Q. Hence we have shown that

Q(y)≧Q(x)  (31)

where equality is strict in (31) unless y=x.

Next, we use Jensen's inequality to conclude that $\begin{matrix}{{\log \frac{G(y)}{G(x)}} = \quad {\log \left\{ {\sum\limits_{ \in Z_{+}^{n}}{\frac{y^{i}}{x^{i}}\frac{g_{i}x^{i}}{G(x)}}} \right\}}} \\{\geq \quad {\sum\limits_{ \in Z_{+}^{n}}{\left( {\log \frac{y^{i}}{x^{i}}} \right)\frac{g_{i}x^{i}}{G(x)}}}} \\{= \quad {\frac{\left( {{Q(y)} - {Q(x)}} \right)x^{T}{\nabla{G(x)}}}{G(x)}.}}\end{matrix}$

Hence, we have established (30) and moreover have demonstrated thatequality holds in (30) if and only if y=x. That is, if and only ifxεE_(K) ^(n) satisfies the equation${Kx} = {\frac{\nabla{G(x)}}{x^{T}{\nabla{G(x)}}}.}$

This is the equation for the stationary values for G over the set E_(K)^(n). In the case that interests us, namely, for G=M_(K), the stationaryequations have a unique solution in E_(K) ^(n) at the maximum of M_(K).For this case we summarize our observations in the next result.

Theorem 8: Suppose K is a symmetric positive definite matrix withnonnegative elements. Given an xε{square root over (n)}intE_(K) ^(n)choose yε{square root over (n)}intE_(K) ^(n) such that

y·(Ky)=x·K((Kx)⁻¹).

Then P_(K)(y)>P_(K)(x) unless x=y.

Using this result we generate a sequence of vectors x^(k)ε{square rootover (n)}intE_(K) ^(n),kεN by the equation

x ^(k+1) ·Kx ^(k+1) =x ^(k) ·K((Kx ^(k))⁻¹),  (32)

so that the sequence {x^(k):kεN} either terminates at a finite number ofsteps at the vector satisfying (27) or converges to it as k→∞. Moreover,in the latter case P_(K)(x^(k)) monotonically increases to the maximumof P_(K) in R₊ ^(n). This theorem gives a “hill climbing” iterativemethod of the type used to train Hidden Markov Models, cf. L. E. Baum,T. Petrie, G. Soules, and N. Weiss, cited above; F. Jelinek, citedabove. However, the equation (32) must be solved numerically. Therefore,it is much simpler to use equation (27) directly to find the maximum ofP_(K) on the set R₊ ^(n).

A naive approach to find the unique solution to (27) is to definevectors c^(r), rεN by the iteration

c ^(r+1)=(Kc ^(r))⁻¹ , rεN.  (33)

In general this iteration will diverge. For example, when K=4I and theinitial vector is chosen to be c¹=e, then for all rεN we have thatc^(r)=2¹⁺⁽⁻¹⁾ ^(r) e which obviously does not converge to the maximum ofP_(K), which occurs at the vector whose coordinates are all equal to ahalf.

To avoid this difficulty we consider the iteration $\begin{matrix}{{c^{r + 1} = \sqrt{\frac{c^{r}}{{Kc}^{r}}}},{r\quad \in \quad {N.}}} & (34)\end{matrix}$

Note that iteration (34) maps intR₊ ^(n) into itself. Therefore, wealways initialize the iteration with a vector in intR₊ ^(n). The thoughtwhich led us to equation (34) was motivated by the apparent deficienciesof the iteration (33). Later we realized the connection of equation (34)to the Sinkhorn iteration, see R. Sinkhorn, cited above; R. Sinkhorn andP. Knopp, cited above, which we now explain.

We can rewrite this vector iteration as a matrix equation. To this end,we set

K _(ij) ^(r) :=c _(i) ^(r) K _(ij) c _(j) ^(r) , i,jεZ _(n) , rεN.

Then equation (34) implies that${K_{ij}^{r + 1} = \frac{K_{ij}^{r}}{\sqrt{s_{i}^{r}s_{j}^{r}}}},,{j \in Z_{n}},{r \in N}$

where${s_{i}^{r}:={\sum\limits_{j \in Z_{r}}K_{ij}^{r}}},{ \in {Z_{n}.}}$

Thus, the components of the vectors c^(r), rεN generate a sequence ofpositive definite matrices K^(r), rεN, initialized with the matrix K, bya balanced column and row scaling. This iteration preserves symmetry ofall the matrices, in contrast to the method of R. Sinkhorn, cited above;R. Sinkhorn and P. Knopp, cited above.

Next, we define for rεN, w^(r)=c^(r)·(Kc^(r)), m^(r)=min{w_(i)^(r):iεZ_(n)}, M^(r)=max{w_(i) ^(r):iεZ_(n)} and observe the followingfact.

Lemma 3: Let K be an n×n matrix with nonnegative entries. Then thesequence M^(r)/m^(r), rεN is nondecreasing. Moreover, if k_(ij)>0 fori,jεZ_(n) then it is strictly decreasing.

Proof: For any rεN we have the equations $\begin{matrix}\begin{matrix}{w^{r + 1} = {c^{r + 1} \cdot {Kc}^{r + 1}}} \\{= {\frac{c^{r}}{\sqrt{c^{r} \cdot {Kc}^{r}}} \cdot {K\left( \frac{c^{r}}{\sqrt{c^{r} \cdot {Kc}^{r}}} \right)}}} \\{= {\frac{c^{r}}{\sqrt{w^{r}}} \cdot {K\left( \frac{c^{r}}{\sqrt{w^{r}}} \right)}}}\end{matrix} & (35)\end{matrix}$

from which it follows that $\begin{matrix}\begin{matrix}{w^{r + 1} \leq \quad {\frac{1}{\sqrt{m^{r}}}{\frac{c^{r}}{\sqrt{w^{r}}} \cdot {Kc}^{r}}}} \\{\leq \quad {\frac{1}{\sqrt{m^{r}}}\sqrt{w^{r}}}}\end{matrix} & (36)\end{matrix}$

and also $\begin{matrix}{w^{r + 1} \geq {\frac{1}{\sqrt{M^{r}}}{\sqrt{w^{r}}.}}} & (37)\end{matrix}$

In particular, we get $\begin{matrix}{{M^{r + 1} \leq \sqrt{\frac{M^{r}}{m^{r}}}},} & (38)\end{matrix}$

$\begin{matrix}{m^{r + 1} \geq \sqrt{\frac{m^{r}}{M^{r}}}} & (39)\end{matrix}$

and we conclude that the ratio M^(r)/m^(r) is nondecreasing in r.Furthermore, if all the components of the vector c^(r) and elements ofthe matrix K are positive we see that the sequence is decreasing.

Note that the sequence of vectors c^(r), rεN given by (34) are boundedindependent of r if k_(ii)>0, iεZ_(n). Specifically, we setκ=min{k_(ii):iεZ_(n)} and observe for rεN that c^(r)≦κ^(−½)e. Inaddition, the maximum norm satisfies the bound ∥c^(r)∥_(∞)≧ρ^(−½) whereρ is the largest eigenvalue of K. To see this we recall that${\rho = {\max \left\{ {{{\min \left\{ {{\frac{({Kx})_{i}}{x_{i}}:{ \in Z_{n}}},{x_{i} \neq 0}} \right\}}:x} = {\left( {{x_{1,}\ldots}\quad,x_{n}} \right)^{T} \in {R_{+}^{n}\backslash \left\{ 0 \right\}}}} \right\}}},$

cf. R. A. Horn and C. R. Johnson, Matrix Analysis, Cambridge UniversityPress, Cambridge, U.K. 1985, page 504, from which the claim follows byinduction.

The iteration (34) can be accelerated by computing intermediate vectorsv^(r), rεN${c_{i}^{r + 1} = \sqrt{\frac{c_{i}^{r}}{v_{i}^{r}}}},{ \in Z_{n}}$

where${v_{i}^{k} = {{\sum\limits_{j = 1}^{i - 1}\quad {k_{ij}c_{j}^{k + 1}}} + {\sum\limits_{j = 1}^{n}\quad {k_{ij}c_{j}^{k}}}}},{ \in {Z_{n}.}}$

Maximum Likelihood Estimation

In this section we consider another method for determining the constantsc₁, . . . , c_(n) for the density estimator given in equation (15). Herewe take a parametric density estimation perspective. Thus, we study theproblem of maximizing the likelihood function$\prod\limits_{ \in Z_{n}}\quad {f\left( x^{i} \right)}$

over the simplex S^(n) where ƒ has the functional form (15). In otherwords, we desire to maximize L_(K)(c)=Π(Kc) for all cεS^(n). A problemof this type arises in the method of detailed interpolation which iswidely used in language modeling for speech recognition, see L. R. Bahl,P. F. Brown, P. V. de Souza, R. L. Mercer, D. Nahamoo, cited above. Infact, in this application the matrix K has more rows than columns. Withthis in mind we study the problem of maximizing L_(K) in greatergenerality than dealt with so far. To distinguish this case from the oneconsidered in the previous section we adopt a slightly differentnotation in this section. We begin with a n×k matrix A with nonnegativeentries and consider the problem of maximizing the homogeneouspolynomial of total degree at most n given by

M _(A)(x)=Π(Ax), xεR ^(k)  (40)

over the simplex

S ^(k) ={x:xεR ₊ ^(k) , e ^(T) x=1}.

Note that even when A is a nonsingular square matrix the likelihoodfunction M_(A) may take on its maximum on the boundary of S^(n). Forexample, corresponding to the matrix $A = \begin{pmatrix}1 & 3 \\1 & 1\end{pmatrix}$

the maximum of M_(A) occurs at c=(0,1)^(T)εS².

The fact that M_(A) is a homogenous polynomial suggests that we use theBaum Welch algorithm to maximize it over S^(k). To this end, wespecialize the Baum Welch algorithm to our context.

Theorem 9: Suppose A is an n×k matrix with nonzero rows and nonnegativeelements. For every cεintS^(k) the vector ĉ whose coordinates aredefined by the equation$\hat{c} = {\frac{c}{n} \cdot {A^{T}\left( ({Ac})^{- 1} \right)}}$

is likewise in intS^(k). Moreover M_(A)(ĉ)≧M_(A)(c) where strictinequality only holds unless M_(A) takes on its minimum on S^(k) at thevector c.

Proof: Since M_(A) is a homogenous polynomial of degree n withnonnegative coefficients we can appeal to a result by L. E. Baum and J.A. Eagon, cited above, that states that the vector c whose componentsare defined by the equation$\hat{c} = \frac{c \cdot {\nabla{M_{A}(c)}}}{{nM}_{A}(c)}$

has all the required properties. We leave it to the reader to verifythat c has the desired form.

Under the conditions of the above theorem we point out that the BaumWelch algorithm globally converges.

Theorem 10: Let A be an n×k matrix of rank k satisfying the hypothesisof Theorem 9. Then M_(A) has a unique maximum on S^(k) and the BaumWelch algorithm with any initial vector in int S^(k) converges to thismaximum.

Proof: The main point of the proof is the observation that M_(A) is logconcave since for any vectors c, y in R^(k) with Acε®\{0})^(n) we havethat $\begin{matrix}{{y^{T}{\nabla^{2}\log}\quad {M_{A}(c)}y} = {{- {}}\frac{Ay}{Ac}{{}^{2}.}}} & (41)\end{matrix}$

As in the case of our study of the penalized likelihood function P_(K)we give a sufficient condition on an n×n matrix A such that M_(A)achieves its maximum on S^(n) at the vector $\frac{1}{n}{e.}$

Theorem 11: Suppose A is an n×n symmetric matrix with nonnegativeelements such that Ae=μe for some positive number μ. Then M_(A) has itsunique maximum on S^(n) at the vector $\frac{1}{n}{e.}$

Proof: By the arithmetic geometric inequality we have for any cεS^(n)that${\left\{ {M_{A}(c)} \right\}^{1/n} \leq {\frac{1}{n}e^{T}{Ac}}} = {\frac{\mu}{n} = {\left\{ {M_{A}\left( {\frac{1}{n}e} \right)} \right\}^{1/n}.}}$

In the spirit of the previous section, we describe a problem which isdue to the problem of maximizing M_(A) over S^(k). To present the resultwe introduce for every ρεR₊ the set

G _(ρ) ^(n) ={u:uεR ₊ ^(n), Π(u)≧1, e ^(T) u≦ρ},

and for ρ=∞ we denote this set by G^(n). We also need the set

S _(ρ) ^(k) ={c:cεS ^(k) , Ac≧ρe}.

Note that both of the sets G_(ρ) ^(n) and S_(ρ) ^(k) are nonempty convexand compact sets when ρ is not infinite.

Theorem 12: Let A be an n×k matrix with nonnegaitve elements such thateach column has at least one nonzero element. Then $\begin{matrix}{{\max \left\{ {{M_{A}(c)}:{c \in S^{k}}} \right\}} = {\frac{1}{n}\min {\left\{ {{{}A^{T}u{}_{\infty}}:{u \in G^{n}}} \right\}.}}} & (42)\end{matrix}$

For this proof we use the min max theorem of von Neuman, cf. C. Bergeand A. Ghouile-Houri, Programming, Games and Transportation Networks,John Wiley, New York, pp. 65-66, 1965, which we repeat here for theconvenience of the reader.

Theorem 13: Let U and V be compact nonempty convex sets in R^(n) andR^(k) respectively. If ƒ(x,y) is a function on R^(n)×R^(k) that is uppersemi-continuous and concave with respect to x and lower semi-continuousand convex with respect to y then

max_(xεU) min_(yεV)ƒ(x,y)=min_(yεV) max_(xεU)ƒ(x,y).

Proof of Theorem 12: We first prove the theorem under the hypothesisthat all the elements of A are positive. Therefore, for all Esufficiently small

max{M _(A)(c):cεS ^(k)}=max{M _(A)(c):cεS _(ε) ^(k)}.

Hence, by Lemma 2 we have that$\left( {\max \left\{ {{M_{A}(c)}:{c \in S^{k}}} \right\}} \right)^{1/n} = {\frac{1}{n}\max {\left\{ {{\min \left\{ {u^{T}{{Ac}:{u \in G^{n}}}} \right\}}:{c \in S_{\in}^{k}}} \right\}.}}$

Using the fact that A(S_(ε) ^(k)) is a bounded subset of int R₊ ^(n) weget that the right hand side of the above equation equals$\frac{1}{n}\max \left\{ {{\min \left\{ {u^{T}{{Ac}:{u \in G_{\rho}^{n}}}} \right\}}:{c \in S_{\in}^{k}}} \right\}$

for all ρ sufficiently large. By the von Neuman minmax theorem thisquantity equals$\frac{1}{n}\min {\left\{ {{\max \left\{ {\left( {A^{T}u} \right)^{T}{c:{c \in S_{\in}^{k}}}} \right\}}:{u \in G_{\rho}^{n}}} \right\}.}$

Observe that for a fixed u the maximum over c in S^(k) occurs at avector all of whose coordinates are zero except for one coordinate.Since A has all positive elements this vector will be in S_(ε) ^(k) forall ε sufficiently small. Hence the right hand side of the aboveequation equals$\frac{1}{n}\min {\left\{ {{{}A^{T}u{}_{\infty}}:{u \in G_{\rho}^{n}}} \right\}.}$

But for ρ sufficiently large this equals$\frac{1}{n}\min {\left\{ {{{}A^{T}u{}_{\infty}}:{u \in G^{n}}} \right\}.}$

This establishes the result for the case that A has all positiveelements. Obviously, any A that satisfies the hypothesis of the Theoremcan be approximated arbitrarily closely by matrices with all positiveelements. Since both sides of equation (42) are continuous functions ofA the result follows.

Degree Raising Instead of Baum Welch

In this section we compute the maximum of M_(A) over the simplex S^(k)using the notion of degree raising, cf. C. A. Micchelli, cited above.This method provides an interesting alternative to the Baum Welchalgorithm described in the previous section. The method proceeds in twosteps. The first step of this alternate procedure requires writing M_(A)as a linear combination of monomials. Thus, there are constants{a_(i):iεZ_(n) ^(k)} such that${{M_{A}(x)} = {\sum\limits_{ \in Z_{n}^{k}}{\begin{pmatrix}n \\i\end{pmatrix}a_{i}x^{i}}}},{x \in R^{k}}$

where$Z_{n}^{k}:={\left\{ {{{i:i} = \left( {i_{1},\ldots \quad,i_{k}} \right)^{T}},{{i} = {{\sum\limits_{j = 1}^{k}\quad i_{j}} = n}}} \right\}.}$

The sequence {a_(i):iεZ_(n) ^(k)} can be computed iteratively in thefollowing way. For mεZ_(n) we define sequences {b_(i) ^(m):iεZ_(m) ^(k)}by the formula${{\prod\limits_{j \in Z_{m}}\quad ({Ax})_{j}} = {\sum\limits_{ \in Z_{m}^{k}}{b_{i}^{m}x^{i}}}},{x \in {R^{k}.}}$

Thus, for mεZ_(n)${b_{j}^{m + 1} = {\sum\limits_{j = 1}^{k}\quad {A_{m + {1j}}b_{j - e^{j}}^{m}}}},{j \in Z_{m + 1}^{k}}$

where e¹, . . . , e^(k) are the coordinate vectors in R^(k) defined by

e _(r) ^(j)=δ_(jr) ,j,rεZ _(k).

b _(i) ¹ =A _(1i) , iεZ ^(k).

In particular, we have that ${b_{i}^{n} = {\begin{pmatrix}n \\i\end{pmatrix}a_{i}}},{ \in {Z_{n}^{k}.}}$

Note that by the binomial theorem the quantity

∥M _(A)∥_(∞)=max{|M _(A)(x)|:xεS ^(k)}

does not exceed

max{|a _(i) |:iεZ _(n) ^(k)}

so our “first guess” for ∥M_(A)∥_(∞) is max{|a_(i)|:iεZ_(n) ^(k)}.Moreover, by the arithmetic geometric inequality the function x^(i),xεR^(k) has its maximum on Z_(n) ^(k) at i/n, and so our “first guess”for the maximum of M_(A) will be x=j/n where

j=argmax{|a _(i) |:iεZ _(n) ^(k)}.

To improve this initial guess, we use degree raising.

For any nonnegative integer r, M_(A) is also a homogenous polynomial ofdegree n+r. Hence, there are constants {a_(i) ^(r):iεZ_(n+r) ^(k)} suchthat ${{M_{A}(x)} = {\sum\limits_{ \in Z_{n + r}^{k}}{\begin{pmatrix}{n + r} \\i\end{pmatrix}a_{i}^{r}x^{i}}}},{x \in R^{k}}$

It is proved in C. A. Micchelli and A. Pinkus, cited above, that thesequence max{|a_(i) ^(r)|:iεZ_(n+r) ^(k)} is nondecreasing in rεN andconverges to ∥M_(A)∥_(∞) at the rate O(1/r). Moreover, if j_(r)εZ_(n+r)^(k) is defined by

j _(r)=argmax{|a _(i) ^(r) |:iεZ _(n+r) ^(k)}

then j_(r)/(n+r) is an approximation to

argmax{|M _(A)(x)|:xεS ^(k)}.

The sequence {a_(i) ^(r):iεZ_(n+r) ^(k)} can be recursively generated bythe formula${a_{i}^{r + 1} = {\frac{1}{n + r + 1}{\sum\limits_{j = 1}^{k}\quad {i_{j}a_{i - e^{j}}^{r}}}}},{i = {\left( {i_{1},\ldots \quad,i_{k}} \right)^{T} \in Z_{n + r + 1}^{k}}}$

where a_(i) ⁰=a_(i),iεZ_(r) ^(k), cf. C. A. Micchelli and A. Pinkus,cited above. From this formula we see that the sequence max{|a_(i)^(r)|:iεZ_(n+r) ^(k)} is nonincreasing as stated above as r?∞ convergesto ∥M_(A)∥_(∞) from above while the Baum Welch algorithm generates anondecreasing sequence which converges to ∥M_(A)∥₂₈ from below.

We remark in passing that the sequence {a_(i):iεZ_(n) ^(k)} can also beexpressed in terms of the permanent of certain matrices formed from A byrepeating its rows and columns, cf. I. P. Goulden and D. M. Jackson,Combinatorial Enumeration, John Wiley, New York, 1983; C. A. Micchelli,cited above.

Density Estimation with Baum Welch, Degree Raising, and DiagonalScaling—A Numerical Comparison

We begin with the kernel $\begin{matrix}{{{k\left( {x,y} \right)} = \frac{1}{\left. {\left( {1 + {{}x} - y} \right){}^{2}} \right)^{2}}},x,{y \in R^{d}}} & (43)\end{matrix}$

which is positive definite for all d. We restrict ourselves to R² andchoose data arranged equally spaced on the unit circle x^(k)=(cos 2πk/n, sin 2 πk/n)^(T), kεZ_(n). Note that the row sums of the matrix$K = \begin{pmatrix}{k\left( {x^{1},x^{1}} \right)} & \cdots & {k\left( {x^{1},x^{n}} \right)} \\\vdots & ⋰ & \vdots \\{k\left( {x^{n},x^{1}} \right)} & \cdots & {k\left( {x^{n},x^{n}} \right)}\end{pmatrix}$

are all the same. In our computation, we use the matrix A=cK where c ischosen so that Ae=e. Therefore, by Theorem 11 the Baum Welch algorithmgiven by the iteration $\begin{matrix}{{c^{({k + 1})} = {\frac{c^{(k)}}{n}{A\left( \left( {Ac}^{k} \right)^{- 1} \right)}}},{k \in N}} & (44)\end{matrix}$

converges to ĉ=e/n. Similarly, the penalized likelihood iteration$\begin{matrix}{{c^{({k + 1})} = \sqrt{\frac{c^{k}}{{Ac}^{k}}}},{k \in N}} & (45)\end{matrix}$

will converge to ĉ.

Our numerical experience indicates that the performance of theseiterations for moderate values of n up to about 100 is insensitive tothe initial starting vector. Therefore, as a means of illustration wechose n=4 and randomly pick as our starting vector c¹=(0.3353, 0.1146,0.2014, 0.3478)^(T). We plot in FIG. 1 for both iterations above theerror ε^(k)=∥c^(k)−e/n∥₂ on a log scale. One can see that theconvergence is exponential as the curves plotted are straight lines.

In our next example, we compute the maximum of the homogenous polynomialM_(A) on S³ where A is an n×3 matrix. Such a problem arises in languagemodeling where n typically falls in the range from one million to eighthundred million, see L. R. Bahl, P. F. Brown, P. V. de Souza, R. L.Mercer and D. Nahamoo, cited above. In our numerical example we restrictourselves to n=20, and consider the multiquadric matrix

A={(1+|i−j| ²)^(½)}_(iεZ) _(n) _(,jεZ) ₃

and the cosine matrix$A = {\left\{ {\cos^{2}\left( \frac{2{\pi \left( {i + j} \right)}}{n} \right)} \right\}_{ \in {Z_{n}j} \in Z_{3}}.}$

In each case, we plot in FIGS. 2A and 2B log|M_(A)(c^(k))|,kεN for c^(k)generated by the Baum Welch algorithm and also degree raising. Note themonotone behavior of the computed values. In the Baum-Welch case 21 wechoose c=e/3 as our initial vector. The degree raising iteration 22 doesnot require an initial vector. Note that in the first graph in FIG. 2Adegree raising converges in one step, while in the second case shown inFIG. 2B Baum Welch does much better than degree raising.

We now turn our attention to univariate density estimation. We comparethe Parzen estimator to those generated by the Baum Welch algorithmapplied to maximizing M_(K) on S^(n) and the PML estimator obtained bymaximizing P_(M) on R₊ ^(n), rescaled to S^(n). Although we haveexamined several typical density functions including the chi-squared,uniform and logarithmic densities, we restrict our discussion to twobimodal densities generated from a mixture of two gaussians.

Recall that the Parzen estimator is given by${\frac{1}{nh}{\sum\limits_{i = 1}^{n}\quad {{hk}\left( \frac{x - x^{i}}{h} \right)}}},{x \in {R.}}$

In our numerical examples we choose n=500, h=0.3, and k(x)=1/{squareroot over (2π)}e^(−x) ² ^(/2),xεR. In FIG. 3 the actual bimodal densityis denoted by reference numeral 31 while the Parzen estimator is denotedby reference numeral 32 and the PMLE estimator is denoted by referencenumeral 33.

In FIG. 4, we display the Baum Welch estimator. It is interesting tonote that the Baum Welch estimator, denoted by reference numeral 41,fits the peak of the true density, denoted by reference numeral 42, muchbetter than the Parzen estimator, denoted by reference numeral 43, butit also displays more of an oscillatory behavior at 41 a.

In FIGS. 5A and 5B, we choose a bimodal density with more discerniblehumps and compare graphically the behavior of the Baum Welch estimates,denoted by reference numeral 51, to the Parzen estimator, denoted by thereference numeral 52.

All the methods above are sensitive to the kernel k, the bin size h andthe sample size n. For example, choosing h to be 0.6, but keeping thesame value n=500 and a gaussian kernel the new Baum Welch estimator,denoted by reference numeral 61, with the actual probability density,denoted by the reference numeral 62, appears in FIGS. 6A and 6B.

Increasing the sample size to n=2000 while maintaining the value h=0.3and a gaussian kernel the Baum Welch estimator takes the form in FIGS.6A and 6B.

Changing the kernel k produces more dramatic alterations in the densityestimator. For example we consider two spline kernelsk_(i)(x,y)=ƒ_(i)(x−y/h),x,yεR where${f_{i}(x)} = \left\{ \begin{matrix}{{b\left( {a^{2} - x^{2}} \right)}^{i},} & {{{for}\quad {x}} < a} \\{{0,}\quad} & {{otherwise}\quad}\end{matrix} \right.$

for i=1,2, where the constants a and b are chosen so that∫_(−∞)^(∞)f_(i)(x)  x = ∫_(−∞)^(∞)f_(i)  (x)x = 1.

The corresponding Baum Welch estimators, denoted by reference numeral71, for h=0.3 and n=500 are displayed in FIGS. 7A and 7B. These graphsseem to indicate that the gaussian kernel works best.

We now apply the Baum Welch and the Parzen estimator to speech data andgraphically compare their performance. The speech data is taken from theWall Street Journal database for the sound AA (as in the wordabsolve—pronounced AE B Z AA L V). To explain what we have in mind werecall how such data is generated.

Digitized speech sampled at a rate of 16 KHz is considered. A frameconsists of a segment of speech of duration 25 msec, and produces a 39dimensional acoustic cepstral vector via the following process, which isstandard in speech recognition literature. Frames are advanced every 10msec to obtain succeeding acoustic vectors.

First, magnitudes of discrete Fourier transform of samples of speechdata in a frame are considered in a logarithmically warped frequencyscale. Next, these amplitude values themselves are transformed to alogarithmic scale, and subsequently, a rotation in the form of discretecosine transform is applied. The first 13 components of the resultingvector are retained. First and the second order differences of thesequence of vectors so obtained are then appended to the original vectorto obtain the 39 dimensional cepstral acoustic vector.

As in supervised learning tasks, we assume that these vectors arelabeled according to their corresponding basic sounds. In fact, the setof 46 phonemes are subdivided into a set of 126 different variants eachcorresponding to a “state” in the hidden Markov model used forrecognition purposes. They are further subdivided into more elementalsounds called allophones or leaves by using the method of decision treesdepending on the context in which they occur, see, e.g., F. Jelinek,cited above; L. R. Bahl, P. V. Desouza, P. S. Gopalkrishnan and M. A.Picheny, “Context dependent vector quantization for continuous speechrecognition,” Proceedings of IEEE Int. Conf. on Acoustics Speech andSignal Processing, pp. 632-35, 1993; Leo Breiman, Classification andRegression Trees, Wadsworth International, Belmont, Calif., 1983, formore details.

Among these most elemental of sounds known as leaves or allophones wepicked five distinct leaves and two distinct dimensions all chosen fromthe vowel sound AA's first hidden Markov models state AA_1. The resultof generating a histogram with 200 bins, the Baum Welch estimator andthe Parzen estimator for six distinct choices of leaf, dimension pairsare displayed in FIGS. 8A to 8F. The Parzen estimator and the Baum Welchestimator both used the choice h=2.5n^(−⅓) and a gaussian kernel. Thevalues of n was prescribed by the individual data sets and the valueswere 3,957, 4,526, 2,151, 4,898 and 1,183 for respectively leaves 1, 2,5, 7 and 11. The columns in FIG. 8 are respectively the densityestimators for the histogram, the Baum Welch and the Parzen estimator.It can be seen from these examples that the Baum Welch estimator capturefiner details of the data than does the Parzen estimator for the samevalue of h.

While the invention has been described in terms of preferredembodiments, those skilled in the art will recognize that the inventioncan be practiced with modification within the spirit and scope of theappended claims.

Having thus described our invention, what we claim as new and desire tosecure by Letters Patent is as follows:
 1. A computer implemented methodfor machine recognition of speech, comprising the steps of: inputtingacoustic data; forming a nonparametric density estimator${{f_{n}(x)} = {\sum\limits_{ \in Z_{n}}{c_{i}{k\left( {x,x^{i}} \right)}}}},{x \in R^{d}},{{{where}\quad Z_{n}} = \left\{ {1,2,\ldots \quad,n} \right\}},{k\left( {x,y} \right)}$

 is some specified positive kernel function,${c_{i} \geq 0},{ \in Z_{n}},{{\sum\limits_{i = 1}^{n}\quad c_{i}} = 1}$

 are parameters to be chosen, and {x^(i)}_(iεZ) _(n) is a given set oftraining data; setting a kernel for the estimator; selecting astatistical criterion to be optimized to find values for parametersdefining the nonparametric density estimator; and iteratively computingthe density estimator for finding a maximum likelihood estimation ofacoustic data.
 2. The computer implemented method for machinerecognition of speech as recited in claim 1, wherein the statisticalcriterion to be optimized is selected from a penalized maximumlikelihood criteria or a maximum likelihood criteria.
 3. The computerimplemented method for machine recognition of speech as recited in claim2, wherein the maximum likelihood criteria is selected.
 4. The computerimplemented method for machine recognition of speech as recited in claim3, wherein a process of degree raising of polynomials is used for globalmaximization.
 5. The computer implemented method for machine recognitionof speech as recited in claim 3, wherein the step of iterativelycomputing employs a process of diagonal balancing of matrices tomaximize the likelihood.
 6. The computer implemented method for machinerecognition of speech as recited in claim 5, wherein a form of theiteratively computing is ĉ=c/nA^(T)((Ac)⁻¹), where c=(c₁, c₂, . . . ,c_(n)), c_(i)≧0, ${\sum\limits_{i = 1}^{n}\quad c_{i}} = 1$

is an initial choice for the parameters, ĉ is the updated parameterchoice, and K={k(x^(i),x^(j))}_(i,jεZ) _(n) , A=cK, where c is chosensuch that Ae=e for e=(1, 1, . . . ,1).
 7. The computer implementedmethod for machine recognition of speech as recited in claim 2, whereinthe penalized maximum likelihood criteria is selected.
 8. The computerimplemented method for machine recognition of speech as recited in claim7, wherein the step of iteratively computing employs a process ofdiagonal balancing of matrices to maximize the penalized likelihood. 9.The computer implemented method for machine recognition of speech asrecited in claim 8, wherein the step of iteratively computing thedensity estimator uses an update of parameters given as a unique vectorvεintS^(n)(b) satisfying v·Kv=K((Kc)⁻¹)·c−γv·b, where b_(i)=∫_(R)_(^(d)) k(x,x^(i))dx, b=(b₁, b₂, . . . ,b_(n)), S^(n)(b)={c:cεR^(d),b^(T)c=1} and γ=n−v^(T)Kv.
 10. The computer implemented method formachine recognition of speech as recited in claim 9, wherein the updateparameter is given as${v = \frac{c \cdot {K\left( {({Kc})^{- 1} - {\sigma \quad e}} \right)}}{n - {\sigma \quad c^{T}{Kc}}}},$

where σ>0 is a parameter chosen to yield a best possible performance.11. The computer implemented method for machine recognition of speech asrecited in claim 1, wherein the kernel is a Gaussian kernel.
 12. Thecomputer implemented method for machine recognition of speech as recitedin claim 1, wherein the kernel is given by the formula${{k\left( {x,y} \right)} = \frac{1}{\left( {1 + {{}x} - {y{}^{2}}} \right)^{2}}},$

x,yεR^(d), where k(x,y), x,yεR^(d) is a reproducing kernel for a Hilbertspace of functions on R^(d).
 13. The computer implemented method formachine recognition of speech as recited in claim 1, wherein$c = {{\frac{e}{n}\quad {and}\quad {k\left( {x,y} \right)}} = {\frac{1}{h}{{k\left( \frac{x - y}{h} \right)}.}}}$


14. The computer implemented method for machine recognition of speech asrecited in claim 1, further comprising the step of assigning the maximumlikelihood estimation to a phoneme label.
 15. The computer implementedmethod for machine recognition of speech as recited in claim 1, whereinthe non-parametric density estimator has the form${{f_{n}(x)} = {\frac{1}{{nh}^{d}}{\sum\limits_{ \in Z_{n}}{k\left( \frac{x - x^{i}}{h} \right)}}}},$

xεR^(d), where Z_(n)={1, . . . ,n}, k is some specified function, and{x^(i):iεZ_(n)} is a set of observations in R^(d) of some unknown randomvariable.